Data challenge - churn prevention for Revolut

Prepare packages

We're reading below standard data science toolkit, including tools for data processing (e.g. dask - "pandas in parallel"), visualization tools (like e.g. plotly, seaborn) and some machine-learning frameworks (scikit-learn, and similar).

System packages

In [1]:
import multiprocessing
In [2]:
import sys
import os
In [133]:
from datetime import datetime
In [3]:
module_path = os.path.abspath(os.path.join('../code/'))
if module_path not in sys.path:
    sys.path.append(module_path)

Data science toolkit

In [4]:
import numpy as np
import pandas as pd
import dask.dataframe as dd

Machine learning

In [5]:
import featuretools as ft

Visualizations

In [6]:
import seaborn as sns
In [40]:
import plotly.express as px

Other

In [7]:
from pandas_profiling import ProfileReport
In [8]:
from rnormalize import normalize_user_id
In [43]:
from rtools import get_user_plan

Read data

This Jupyter notebook follows rules of reproducible resarch, meaning it can be re-run (after preparing the environment, as defined in requirements.txt) - resulting in similar values in the output and exactly same business conclussions. The only requirement is to have the same file/dir structure - notebook should be located in the notebooks directory, above, we should have data in the data directory, and some additional sources in the code directory.

file location

In [9]:
devices_filename = '../data/rev-devices.csv'
notifications_filename = '../data/rev-notifications.csv'
users_filename = '../data/rev-users.csv'
transactions_filename = '../data/rev-transactions.csv'

read csv files

In [10]:
devices = dd.read_csv(devices_filename)
f'File {devices_filename} was successfully read'
Out[10]:
'File ../data/rev-devices.csv was successfully read'
In [11]:
notifications = dd.read_csv(notifications_filename)
f'File {notifications_filename} was successfully read'
Out[11]:
'File ../data/rev-notifications.csv was successfully read'
In [12]:
users = dd.read_csv(users_filename)
f'File {users_filename} was successfully read'
Out[12]:
'File ../data/rev-users.csv was successfully read'
In [13]:
transactions = dd.read_csv(transactions_filename)
f'File {transactions_filename} was successfully read'
Out[13]:
'File ../data/rev-transactions.csv was successfully read'

Data book

The objective of the Data Book is to provide useful summaries of data, which are easily understood by everyone.

  1. table: devices.csv

    a table of devices associated with a user

    • brand​: string corresponding to the phone brand
    • user_id​: string uniquely identifying the user
  1. table: users.csv

    a table of user data

    • user_id​: string uniquely identifying the user
    • birth_year​: integer corresponding to the user’s birth year
    • country​: two letter string corresponding to the user’s country of residence
    • city​: two string corresponding to the user’s city of residence
    • created_date​: datetime corresponding to the user’s created date
    • user_settings_crypto_unlocked​: integer indicating if the user has unlocked the crypto currencies in the app
    • plan​: string indicating on which plan the user is on
    • attributes_notifications_marketing_push​: float indicating if the user has accepted to receive marketing push notifications
    • attributes_notifications_marketing_email​: float indicating if the user has accepted to receive marketing email notifications
    • num_contacts​: integer corresponding to the number of contacts the user has on Revolut
    • num_referrals​: integer corresponding to the number of users referred by the selected user
    • num_successful_referrals​: integer corresponding to the number of users successfully referred by the selected user (successfully means users who have actually installed the app and are able to use the product)
  1. table: notifications.csv

    a table of notifications that a user has received

    • reason​: string indicating the purpose of the notification
    • channel​: string indicating how the user has been notified
    • status​: string indicating the status of the notification
    • user_id​: string uniquely identifying the user
    • created_date​: datetime indicating when the notification has been sent
  1. table: transactions.csv

    a table with transactions that a user made

    • transaction_id​: string uniquely identifying the transaction
    • transactions_type​: string indicating the type of the transaction
    • transactions_currency​: string indicating the currency of the transaction
    • amount_usd​: float corresponding to the transaction amount in USD
    • transactions_state​: string indicating the state of a transaction
    • COMPLETED - the transaction was completed and the user's balance was changed
    • DECLINED/FAILED - the transaction was declined for some reason, usually pertains to insufficient balance
    • REVERTED - the associated transaction was completed first but was then rolled back later in time potentially due to customer reaching out to Revolut
    • ea_cardholderpresence​: string indicating if the card holder was present when the transaction happened
    • ea_merchant_mcc​: float corresponding to the Merchant Category Code (MCC)
    • ea_merchant_city​: string corresponding to the merchant’s city
    • ea_merchant_country​: string corresponding to the merchant’s country
    • direction​: string indicating the direction of the transaction
    • user_id​: string uniquely identifying the user
    • created_date​: datetime corresponding to the transaction’s created date

First glance at the data

We're checking here what's the size of data and how many columns we got

Data volume

In [14]:
f'There are {devices.shape[0].compute()} rows in the "devices" table'
Out[14]:
'There are 19430 rows in the "devices" table'
In [15]:
f'There are {notifications.shape[0].compute()} rows in the "notifications" table'
Out[15]:
'There are 121813 rows in the "notifications" table'
In [16]:
f'There are {users.shape[0].compute()} rows in the "users" table'
Out[16]:
'There are 19430 rows in the "users" table'
In [17]:
print(f'''There are {transactions.shape[0].compute()} rows in the "transactions" table. 
      The transactions table is {int(transactions.shape[0].compute() / users.shape[0].compute())} times bigger 
      than the number of users.''')
There are 2740075 rows in the "transactions" table. 
      The transactions table is 141 times bigger 
      than the number of users.

Columns and their data types

In [18]:
devices.dtypes
Out[18]:
brand      object
user_id    object
dtype: object

This is rather self-explanatory. It tells what type of device user is using (column - brand).

In [19]:
notifications.dtypes
Out[19]:
reason          object
channel         object
status          object
user_id         object
created_date    object
dtype: object

This is an interesting table. We're especially thrilling to investigate what type of messages are sent to the users (column - reason). Probably they're mostly "push" messages.

In [20]:
users.dtypes
Out[20]:
user_id                                      object
birth_year                                    int64
country                                      object
city                                         object
created_date                                 object
user_settings_crypto_unlocked                 int64
plan                                         object
attributes_notifications_marketing_push     float64
attributes_notifications_marketing_email    float64
num_contacts                                  int64
num_referrals                                 int64
num_successful_referrals                      int64
dtype: object

Data on the user. We have information on the type of plan, as well as, some extra data about number of marketing messages.

In [21]:
transactions.dtypes
Out[21]:
transaction_id            object
transactions_type         object
transactions_currency     object
amount_usd               float64
transactions_state        object
ea_cardholderpresence     object
ea_merchant_mcc          float64
ea_merchant_city          object
ea_merchant_country       object
direction                 object
user_id                   object
created_date              object
dtype: object

Transactional data. It should be possible to find transaction type (atm/transfer/online), as well as country of the merchant.

Data profiling

Users

In [22]:
users_df_profile = ProfileReport(users.sample(frac=0.25).compute(), 
                                 progress_bar=False,
                                 title='Pandas Profiling Report ("Users" dataframe)', 
                                 html={'style':{'full_width':True}})
In [159]:
users_df_profile.to_notebook_iframe()  # .to_widgets()
In [24]:
users_df_profile.to_file(output_file="out/users_df_profile_p0.25.html")  # saving the whole report to an html file

Devices

In [25]:
devices_df_profile = ProfileReport(devices.sample(frac=0.25).compute(), 
                                  progress_bar=False,
                                  title='Pandas Profiling Report ("Devices" dataframe)', 
                                  html={'style':{'full_width':True}})
In [160]:
devices_df_profile.to_notebook_iframe()  # .to_widgets()
In [27]:
devices_df_profile.to_file(output_file="out/devices_df_profile_p0.25.html")

Notifications

In [28]:
notifications_df_profile = ProfileReport(notifications.sample(frac=0.1).compute(), 
                                         progress_bar=False,
                                         title='Pandas Profiling Report ("Notifications" dataframe)', 
                                         html={'style':{'full_width':True}})
In [161]:
notifications_df_profile.to_notebook_iframe()  # .to_widgets()
In [30]:
notifications_df_profile.to_file(output_file="out/notifications_df_profile_p0.10.html")

Transactions

In [31]:
transactions_pd = transactions.compute()  # store df in memory
In [32]:
transactions_df_profile = ProfileReport(transactions_pd, 
                                        minimal=True,
                                        pool_size=multiprocessing.cpu_count()-1,
                                        title='Pandas Profiling Report ("Transactions" dataframe)', 
                                        html={'style':{'full_width':True}})





In [162]:
transactions_df_profile.to_notebook_iframe()  # .to_widgets()
In [34]:
transactions_df_profile.to_file(output_file="out/transactions_df_profile_pALL.html")

Data engineering

In [35]:
users_pd = users.compute()
users_pd['user_id'] = users_pd['user_id'].apply(lambda x: normalize_user_id(x)).astype(int)
In [36]:
devices_pd = devices.compute().reset_index().rename(columns={'index': 'device_id'})
devices_pd['user_id'] = devices_pd['user_id'].apply(lambda x: normalize_user_id(x)).astype(int)
In [52]:
notifications_pd = notifications.compute().reset_index().rename(columns={'index': 'notification_id'})
notifications_pd['user_id'] = notifications_pd['user_id'].apply(lambda x: normalize_user_id(x)).astype(int)
In [38]:
transactions_pd['user_id'] = transactions_pd['user_id'].apply(lambda x: normalize_user_id(x)).astype(int)
In [39]:
transactions_pd.memory_usage(deep=True)
Out[39]:
Index                     21920600
transaction_id           207134590
transactions_type        182275087
transactions_currency    164404500
amount_usd                21920600
transactions_state       180410903
ea_cardholderpresence    134482025
ea_merchant_mcc           21920600
ea_merchant_city         141037387
ea_merchant_country      131953088
direction                177574595
user_id                   21920600
created_date             227426225
dtype: int64

Exploratory data analysis

Data understanding

Let's vizualize some data!

Understanding notifications

In [56]:
notifications_pd.reason.value_counts()
Out[56]:
REENGAGEMENT_ACTIVE_FUNDS               35844
FIFTH_PAYMENT_PROMO                     23743
NO_INITIAL_CARD_USE                     11221
NO_INITIAL_CARD_ORDER                   11136
METAL_GAME_START                         8373
LOST_CARD_ORDER                          6146
NO_INITIAL_FREE_PROMOPAGE_CARD_ORDER     5459
JOINING_ANNIVERSARY                      4158
MADE_MONEY_REQUEST_NOT_SPLIT_BILL        3506
ONBOARDING_TIPS_ACTIVATED_USERS          3145
ENGAGEMENT_SPLIT_BILL_RESTAURANT         2298
BLACK_FRIDAY                             2109
METAL_RESERVE_PLAN                       1233
PREMIUM_ENGAGEMENT_INACTIVE_CARD         1124
WELCOME_HOME                              950
PUMPKIN_PAYMENT_NOTIFICATION              690
PREMIUM_ENGAGEMENT_FEES_SAVED             678
Name: reason, dtype: int64
In [73]:
notifications_pd.channel.value_counts()
Out[73]:
EMAIL    61253
PUSH     57683
SMS       2877
Name: channel, dtype: int64

The 'SMS' channel is used quite rarely, so we're going to omit it in our visualizations.

In [77]:
# notifications_plot_data = notifications_pd[notifications_pd.reason.isin(['REENGAGEMENT_ACTIVE_FUNDS',
#                                                                          'FIFTH_PAYMENT_PROMO',
#                                                                          'METAL_GAME_START',
#                                                                          'NO_INITIAL_CARD_USE',
#                                                                          'NO_INITIAL_CARD_ORDER'])].sample(n=1*1000)
notifications_plot_data = notifications_pd[notifications_pd.channel != 'SMS'].sample(n=30*1000)
notifications_plot_data['user_plan'] = notifications_pd['user_id'].apply(lambda x: get_user_plan(x, users_pd))
notifications_plot_data['total_notifications'] = 1
In [78]:
agg_notifications_plot_data = notifications_plot_data.groupby(['reason', 
                                                               'channel', 
                                                               'status', 
                                                               'user_plan']).agg({'total_notifications': 'sum'})
In [79]:
fig = px.bar(agg_notifications_plot_data.reset_index(), 
             x="reason", 
             y='total_notifications',
             color="user_plan", 
             barmode="stack",
             facet_row="channel", facet_col="status")
fig.show()

Understanding transactions

In [85]:
transactions_pd
Out[85]:
transaction_id transactions_type transactions_currency amount_usd transactions_state ea_cardholderpresence ea_merchant_mcc ea_merchant_city ea_merchant_country direction user_id created_date
0 transaction_0 TRANSFER AED 4.55 COMPLETED NaN NaN NaN NaN OUTBOUND 898 2018-04-03 03:34:21.784487
1 transaction_1 CARD_PAYMENT AED 15.50 COMPLETED FALSE 4111.0 Dubai ARE OUTBOUND 1652 2019-03-19 06:15:59.537032
2 transaction_2 CARD_PAYMENT AED 43.40 COMPLETED FALSE 5814.0 Dubai ARE OUTBOUND 1652 2019-03-18 18:53:41.323032
3 transaction_3 TRANSFER AED 10043.01 COMPLETED NaN NaN NaN NaN OUTBOUND 1652 2019-03-22 14:20:01.513032
4 transaction_4 CARD_PAYMENT AED 43.81 COMPLETED FALSE 5651.0 Abu Dhabi ARE OUTBOUND 5509 2019-03-22 05:42:50.316652
... ... ... ... ... ... ... ... ... ... ... ... ...
459322 transaction_2740070 TRANSFER EUR 1.70 COMPLETED NaN NaN NaN NaN OUTBOUND 16319 2019-04-30 10:10:36.810961
459323 transaction_2740071 TRANSFER EUR 17.02 COMPLETED NaN NaN NaN NaN INBOUND 18233 2019-04-30 18:42:45.443443
459324 transaction_2740072 TRANSFER EUR 4.26 COMPLETED NaN NaN NaN NaN OUTBOUND 13991 2019-05-01 11:34:25.078035
459325 transaction_2740073 TRANSFER EUR 1.70 COMPLETED NaN NaN NaN NaN OUTBOUND 12529 2019-04-30 16:51:26.445105
459326 transaction_2740074 TRANSFER EUR 4.26 COMPLETED NaN NaN NaN NaN OUTBOUND 12060 2019-05-01 06:01:59.441309

2740075 rows × 12 columns

In [98]:
transactions_for_plot = transactions_pd.sample(n=10*1000)
transactions_for_plot['amount_usd'] = transactions_for_plot['amount_usd'].apply(lambda x: np.log(x))
In [99]:
g = sns.catplot(x="direction", 
                y="amount_usd",
                hue="transactions_type", 
                col="transactions_state",
                data=transactions_for_plot, kind="strip",
                height=4, aspect=0.5)

Time range of transactional data

In [143]:
print(f'First transaction in the data is from {transactions_pd.created_date.min()}')
print(f' and the last one is from {transactions_pd.created_date.max()}')
max_date = transactions_pd.created_date.max()
First transaction in the data is from 2018-01-01 08:51:10.248709
 and the last one is from 2019-05-16 18:22:16.959896

Feature engineering

Define relations

In [100]:
entities = {"users" : (users_pd, "user_id"),
            "devices" : (devices_pd, "device_id"),  # we really need that index-as-column 
            # https://github.com/FeatureLabs/featuretools/issues/130
            "notifications" : (notifications_pd, "notification_id"),
            "transactions": (transactions_pd, 'transaction_id'),
           }
In [101]:
relationships = [("users", "user_id", "transactions", "user_id"),
                 ("users", "user_id", "notifications", "user_id"),
                 ("users", "user_id", "devices", "user_id")]

Run deep feature synthesis

In [102]:
feature_matrix_users, features_defs = ft.dfs(entities=entities,
                                             relationships=relationships,
                                             target_entity="users",  # we wish to profile users, so it's target
                                             # n_jobs=multiprocessing.cpu_count()-1,
                                             # I'm getting "zmq.error.ZMQError: Too many open files" - MacOS issue?
                                             verbose=True)
Built 67 features
Elapsed: 02:13 | Progress: 100%|██████████
In [103]:
features_defs
Out[103]:
[<Feature: birth_year>,
 <Feature: country>,
 <Feature: city>,
 <Feature: user_settings_crypto_unlocked>,
 <Feature: plan>,
 <Feature: attributes_notifications_marketing_push>,
 <Feature: attributes_notifications_marketing_email>,
 <Feature: num_contacts>,
 <Feature: num_referrals>,
 <Feature: num_successful_referrals>,
 <Feature: SUM(transactions.amount_usd)>,
 <Feature: SUM(transactions.ea_merchant_mcc)>,
 <Feature: STD(transactions.amount_usd)>,
 <Feature: STD(transactions.ea_merchant_mcc)>,
 <Feature: MAX(transactions.amount_usd)>,
 <Feature: MAX(transactions.ea_merchant_mcc)>,
 <Feature: SKEW(transactions.amount_usd)>,
 <Feature: SKEW(transactions.ea_merchant_mcc)>,
 <Feature: MIN(transactions.amount_usd)>,
 <Feature: MIN(transactions.ea_merchant_mcc)>,
 <Feature: MEAN(transactions.amount_usd)>,
 <Feature: MEAN(transactions.ea_merchant_mcc)>,
 <Feature: COUNT(transactions)>,
 <Feature: NUM_UNIQUE(transactions.ea_merchant_country)>,
 <Feature: NUM_UNIQUE(transactions.transactions_state)>,
 <Feature: NUM_UNIQUE(transactions.transactions_type)>,
 <Feature: NUM_UNIQUE(transactions.transactions_currency)>,
 <Feature: NUM_UNIQUE(transactions.ea_cardholderpresence)>,
 <Feature: NUM_UNIQUE(transactions.direction)>,
 <Feature: NUM_UNIQUE(transactions.ea_merchant_city)>,
 <Feature: MODE(transactions.ea_merchant_country)>,
 <Feature: MODE(transactions.transactions_state)>,
 <Feature: MODE(transactions.transactions_type)>,
 <Feature: MODE(transactions.transactions_currency)>,
 <Feature: MODE(transactions.ea_cardholderpresence)>,
 <Feature: MODE(transactions.direction)>,
 <Feature: MODE(transactions.ea_merchant_city)>,
 <Feature: COUNT(notifications)>,
 <Feature: NUM_UNIQUE(notifications.status)>,
 <Feature: NUM_UNIQUE(notifications.channel)>,
 <Feature: NUM_UNIQUE(notifications.reason)>,
 <Feature: MODE(notifications.status)>,
 <Feature: MODE(notifications.channel)>,
 <Feature: MODE(notifications.reason)>,
 <Feature: COUNT(devices)>,
 <Feature: NUM_UNIQUE(devices.brand)>,
 <Feature: MODE(devices.brand)>,
 <Feature: DAY(created_date)>,
 <Feature: YEAR(created_date)>,
 <Feature: MONTH(created_date)>,
 <Feature: WEEKDAY(created_date)>,
 <Feature: NUM_UNIQUE(transactions.WEEKDAY(created_date))>,
 <Feature: NUM_UNIQUE(transactions.YEAR(created_date))>,
 <Feature: NUM_UNIQUE(transactions.DAY(created_date))>,
 <Feature: NUM_UNIQUE(transactions.MONTH(created_date))>,
 <Feature: MODE(transactions.WEEKDAY(created_date))>,
 <Feature: MODE(transactions.YEAR(created_date))>,
 <Feature: MODE(transactions.DAY(created_date))>,
 <Feature: MODE(transactions.MONTH(created_date))>,
 <Feature: NUM_UNIQUE(notifications.WEEKDAY(created_date))>,
 <Feature: NUM_UNIQUE(notifications.YEAR(created_date))>,
 <Feature: NUM_UNIQUE(notifications.MONTH(created_date))>,
 <Feature: NUM_UNIQUE(notifications.DAY(created_date))>,
 <Feature: MODE(notifications.WEEKDAY(created_date))>,
 <Feature: MODE(notifications.YEAR(created_date))>,
 <Feature: MODE(notifications.MONTH(created_date))>,
 <Feature: MODE(notifications.DAY(created_date))>]
In [104]:
feature_matrix_users
Out[104]:
birth_year country city user_settings_crypto_unlocked plan attributes_notifications_marketing_push attributes_notifications_marketing_email num_contacts num_referrals num_successful_referrals ... MODE(transactions.DAY(created_date)) MODE(transactions.MONTH(created_date)) NUM_UNIQUE(notifications.WEEKDAY(created_date)) NUM_UNIQUE(notifications.YEAR(created_date)) NUM_UNIQUE(notifications.MONTH(created_date)) NUM_UNIQUE(notifications.DAY(created_date)) MODE(notifications.WEEKDAY(created_date)) MODE(notifications.YEAR(created_date)) MODE(notifications.MONTH(created_date)) MODE(notifications.DAY(created_date))
user_id
0 1989 PL Gdansk 1 STANDARD 1.0 1.0 3 0 0 ... 19.0 4.0 5.0 2.0 5.0 7.0 1.0 2019.0 2.0 1.0
1 1975 GB London 0 STANDARD NaN NaN 21 0 0 ... 14.0 4.0 3.0 2.0 4.0 6.0 0.0 2018.0 1.0 18.0
2 1987 PL Poznań 0 STANDARD 0.0 0.0 21 0 0 ... 27.0 12.0 5.0 2.0 4.0 6.0 4.0 2019.0 2.0 8.0
3 1994 FR Paris 0 STANDARD 1.0 0.0 0 0 0 ... 9.0 5.0 5.0 2.0 6.0 7.0 0.0 2018.0 2.0 4.0
4 1985 GB Beckenham 0 STANDARD NaN NaN 2 0 0 ... 14.0 2.0 4.0 2.0 4.0 3.0 2.0 2018.0 1.0 23.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
19425 1994 IE Dublin 0 STANDARD 1.0 1.0 17 0 0 ... 5.0 4.0 6.0 2.0 4.0 6.0 3.0 2019.0 2.0 28.0
19426 1977 ES Madrid 0 STANDARD 1.0 1.0 0 0 0 ... 11.0 3.0 4.0 1.0 3.0 5.0 5.0 2019.0 2.0 3.0
19427 1991 PL Kraków 0 STANDARD 1.0 1.0 13 0 0 ... 13.0 3.0 4.0 2.0 3.0 4.0 0.0 2019.0 1.0 14.0
19428 1984 GB London 0 STANDARD NaN NaN 0 0 0 ... 15.0 3.0 4.0 2.0 3.0 6.0 2.0 2019.0 2.0 19.0
19429 1995 RO Campia Turzii 0 STANDARD 1.0 1.0 10 0 0 ... 26.0 2.0 4.0 2.0 3.0 3.0 3.0 2018.0 12.0 24.0

19430 rows × 67 columns

Questions

1. Define a target metric to measure user engagement.

How would you define an engaged vs. unengaged user?

Please provide the business justification and associated visualisations / rationale in choosing your definition of engagement.

Propose type of segments

Using my domain knowledge and an intuition (as an active Revolut user), I'd like to segment users into 3 segments:

  1. Super-commited users

Users, who make transactions in their home country. They're not afraid to use Revolut on their daily basis, even for local ATMs and online purchases.

  1. Engaged users

Users, who only make transactions abroad. Their main incentive is to save money on currency exchange spread. They're mostly people with non-premium subscriptions.

  1. Unengaged users

Users, who don't make much tranactions, even when they're abroad. Low activity, low responsivness to marketing.

It's practically impossible (with current state of data) to detect whenever the user goes abroad and doesn't use his/her Revolut actively there. While it's very easy to count number of transactions per country, this doesn't answer the question if the user went abroad (or not). Because of those facts, we're going to propose some other metric. Recently, customer loyalty has been measured by a metric called "RFV". Let's modify it by adding another dimension - response to "push" notifications. Such RFV-R metric should be validated against 3 segments defined above (super-commited, engaged, unengaged) by adjusting importance weights.

References:

https://www.intotheminds.com/blog/en/the-financial-times-found-the-most-perfect-kpis-to-measure-success/

  • Recency – when did they last make a transaction?
  • Frequency – how often do they make transactions?
  • Volume – how many transactions have they made?
  • Responsivity - how long it takes to make a transaction since last notification?

Calculate RFV-R

Because calculating RFV score is computational expensive, we're going to calculate this metric only at the last timepoint available in the data. Normally, RFV should be computed on a transactional level and updated after every transaction.

In [106]:
f'Profiling {len(users_pd)} users.. it may take some while'
Out[106]:
'Profiling 19430 users.. it may take some while'
In [112]:
transactions_pd['year-month'] = transactions_pd['created_date'].apply(lambda x: f'{x.year}-{x.month}')
In [113]:
# we're using a groupby approach
recency = transactions_pd.groupby('user_id').agg({'created_date': 'max'})  # recency
In [130]:
# recency
In [120]:
frequency = transactions_pd.groupby(['user_id', 'year-month'])['transaction_id'].count()  # frequency
In [151]:
# frequency
In [124]:
volume = transactions_pd.groupby('user_id')['transaction_id'].count()  # volume
In [156]:
# volume
In [127]:
# responsivity
last_notification = notifications_pd.groupby('user_id').agg({'created_date': 'max'})
In [128]:
users_enriched_pd = users_pd.copy()
In [144]:
users_enriched_pd['recency'] = users_enriched_pd['user_id'].apply(lambda x: (max_date - 
                                                                             recency.loc[x]['created_date']).days 
                                                                  if x in recency.index else None)
In [152]:
users_enriched_pd['frequency'] = users_enriched_pd['user_id'].apply(lambda x: frequency.loc[x].values.mean() 
                                                                    if x in frequency.index else None)
In [157]:
users_enriched_pd['volume'] = users_enriched_pd['user_id'].apply(lambda x: volume.loc[x] 
                                                                 if x in frequency.index else None)
In [169]:
users_enriched_pd['responsivity'] = users_enriched_pd['user_id'].apply(lambda x: (recency.loc[x]['created_date'] - 
                                                                                  last_notification.loc[x]['created_date']).days 
                                                                       if (x in recency.index) and (x in last_notification.index) else None)
In [172]:
users_enriched_pd['responsivity'] = users_enriched_pd['responsivity'].apply(lambda x: None if x < 0 else x)
# it may happen that a notification was sent recently, and the user hasn't done anything

Analyze type of distributions

In [173]:
sns.pairplot(users_enriched_pd[['recency', 'frequency', 'volume', 'responsivity', 'plan']].fillna(0), hue='plan')
Out[173]:
<seaborn.axisgrid.PairGrid at 0x14f263f60>

Check for Pareto principle

In [ ]:
 

Calculate Gini coefficient

In [ ]:
 

2. Using your logic from above, build a model (heuristic/statistical/ML) to classify 'engaged' and 'unengaged' users

Note that features which are directly correlated with your target metric could lead to overfitting

In [ ]:
 
In [ ]:
 

3. Let’s assume an ​unengaged​ user is a churned user. Now suppose we use your model to identify unengaged users and implement some business actions try to convert them to engaged users (commonly known as reducing churn)

How would you set up a test/experiment to check whether we are actually reducing churn?

In [ ]:
 

What metrics and techniques would you use to assess the impact of the business action?

In [ ]: